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Abstract 

We study the core fragment of the Elog wrapping language used in the 
Lixto system (a visual wrapper generator) and formally compare Elog to other 
wrapping languages proposed in the literature. 

1 Important Note 

The long version of the work first presented in the extended abstract ^U], which 
appeared in Proc. 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles 
of Database Systems (PODS 2002), Madison, Wisconsin, ACM Press, New York, 
USA, pp. 17 - 28, was split into two parts for journal publication. The first part 
[TT] studies the expressive power of monadic datalog over trees and establishes the 
connection to the monadic fragment of the visual wrapper language Elog. The 
second part, subject of this paper, studies and compares Elog to other practical 
visual wrapper languages. 

Currently, this paper should be understood as an addendum to and cannot 
be read as a standalone paper. We refer the reader to for all notions of which 
the definitions arc missing. 

In detail, this document extends the treatment of by the following material: 

• The capability of producing a hierarchically structured result is essential to 
tree wrapping. We define the language Elogj in order to be able to make 
the creation of complex nested structures explicit. Elog2 is basically obtained 
by enhancing Elog - with binary predicates in a restricted form, which allow 
to represent hierarchical dependencies between selected nodes in the fixpoint 
computation of an Elog - program. ElogJ is an actual fragment of the wrap- 
ping language Elog used internally in the Lixto system |7| , a commercial visual 
wrapper generator. 

• We take a closer look at two other tree-based approaches to wrapping HTML 
documents. The first is the language of regular path queries (e.g., QEl) with 
nesting. Regular path queries arc considered essential to Web query languages 
PQ, and by extending the language of regular path queries by capabilities for 
producing nested output (and for restricting queries by additional conditions), 
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University of Edinburgh and was sponsored by an Erwin Schrodinger scholarship of the FWF. 
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one obtains a useful wrapping language. We show that this formalism is 
strictly less expressive than Elog^. 

• The second formalism that we compare to Elog2 is HEL [T^J, the wrapping 
language of the commercially available W4F framework, which is the only 
tree-based wrapping formalism besides Elog of which a formal specification 
has been published. Again, we are able to show that HEL is strictly less 
expressive than ElogJ. 

2 Binary Pattern Predicates and Paths with the 
Kleene Star and Ranges: Elog2 

In this section, we step out of our framework of unary information extraction func- 
tions. We enhance Elog - by a limited form of binary pattern predicates, which 
allow to explicitly represent the parent-child relationship of the tree computed as a 
result of the wrapping process, but not more than that. This approach to wrapping 
is basically a mild generalization of our wrapping framework based on unary infor- 
mation extraction functions. The syntax of the full Elog language employs binary 
pattern predicates in precisely the same way as shown below. The subtle increase 
in expressive power will be needed in Section |31 when we compare Elog with other 
practical wrapping languages. A further feature that we will need in Sectional will 
be a way of specifying a path using a regular expression with the Kleene star and 
a "range". We will call the new language obtained ElogJ. 

We mildly generalize the predicate subelem^ of Definition 6.1 in to support 
arbitrary regular expressions tt over E (notably, including the Kleene star). Again, 
subelem 7r («o, is true if node v is reachable from Vq through a downward path 
labeled with a word of the regular language defined by tt. 

A range p defines, given an integer k, a function that maps each 1 < i < k to 
cither or 1. Given a word w = w\ ■ ■ • Wk, p selects those Wi that are mapped to 
1. A range applies to a set of nodes S (written as S[p)) as follows. Let vi • ■ -Vk be 
the sequence of nodes in S arranged in document order. Then, S[p] is the set of 
precisely those nodes Vi for which i is mapped to 1. 

Definition 2.1 Let tt be a regular expression over E and let p be a regular expres- 
sion in the normal form of Proposition 4.13 in |1 1 j which defines a regular word 
language of density one over the alphabet {0, 1}. The binary relation subclcm x p is 
defined as the set of all pairs of nodes (v, v') such that v' £ S[p], where S is the set 
of all nodes vq with subelem w (w, vq). □ 

The normal form of Proposition 4.13 in ^I] is a convenient syntax for specifying 
regular word languages of density one, which in turn allow to elegantly assign a 
unique word over alphabet {0, 1} to a sequence of known length. Note, however, 
that throughout the remainder of the paper, in all languages that we will discuss, 
only much weaker forms of ranges will be required that can always be easily encoded 
as regular expressions of this normal form. For example, a range of the form "i-th 
to j-th node" (where i and j are constant) can be specified by a regular expression 

QJ-l p'-i+l Q* 

Lemma 2.2 The predicate subelem^ iP is definable in MSO over t ut . 

Proof. From the proof of Lemma 5.9 in it is obvious how to define a monadic 
datalog program which defines a predicate S for the set of all nodes reachable 
from a node x distinguished by a special predicate. From this we obtain an MSO 
formula f v {x, S) with the obvious meaning using Proposition 3.3 in |11) . 
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Let p be the range definition, as a regular expression over the alphabet {0, 1} in 
the normal form of Proposition 4.13 in We define an MSO formula ip p (S,Y) 
which is true if there is a word w of length \S\ in the language L(p) and Y is the 
set of nodes in S that, when traversed in document order, are at a position which 
is occupied by a "1" in w. 

Let V p be the program shown in the construction for down transitions of the 
proof of Theorem 4.14 in with a few modifications. Rather than on a list of 
siblings, we try to match p with the set S put into document order. Thus, we have 
to replace occurrences of "firstchild" and "ncxtsibling" with analogous relations 
for navigating the document order -<. For example, an atom nextsibling(x, y) is 
replaced by ip^(x, y) (using an input relation S), where ip^ is defined in MSO as 

ip^(x, y, S) := S(x) A S(y) A x ~< y A {fiz) S(z) Ax ~< z A z -< y. 

That the document-order relation -< itself is MSO-dcfinable is clear from its defini- 
tion as a caterpillar expression in Example 2.5 in [llj . from Lemma 5.9 in |11| . and 
Proposition 3.3 in [llj . 

In the down-transition construction from the proof of Theorem 4.14 in the 
goal is to assign (state assignment) predicates that are actually the symbols of the 
regular language to be matched. In the same way, the unary query that we are 
interested in is the predicate "1" defined by program V p . The formula tp p such 
that, given set S, <p p (S, Y) is true iff Y is the set of all nodes assigned "1" by V p is 
obtained from V p as described in the proof of Proposition 3.3 in |llj . 

Now, it is easy to see that 

subelem T , p (x, y) := (V£)(VY) (^(x, S) A <p p {S, Y))^yeY 

indeed defines the desired relation. □ 

Remark 2.3 The previous proof makes it easy to extend the formalism to support 
also the matching of the range backward (using the reverse document order relation 
>- rather than -<), and in particular selecting only the last clement matching path 
7r (using the range 1.0* and reverse document order). □ 

Now we are in the position to define the language Elog*. . 

Definition 2.4 Let Elog*. be obtained by changing the Elog - language as follows. 
All pattern predicates are now binary and all rules are of the form 

p(x ,x) *-po(-,xo), subelem 7r! p(x ,x), C, R. 

where subelem^p is the predicate of Definition 12.11 C is again a set of condition 
atoms as for Elog" but "contains" is now equivalent to subelem^p (permitting 
ranges and paths defined by arbitrary regular expressions), and R is a set of pattern 
atoms of the form (_, ) . The underscore is a way of writing a variable not referred 
to elsewhere in the rule. The predicate "root" is also pro-forma binary and can be 
substituted as a pattern predicate. 1 □ 

The meaning of a binary pattern atom p(vo, v) is that node v is assigned pred- 
icate p and the inference was started from a parent pattern at node vq ■ Wc define 
unary queries in Elog*, in the natural way, by projecting away the first argument 
positions of our binary pattern predicates. For instance, a program V (containing 
a head predicate p) defines the unary query Q p := {x \ (3xq) p{xo,x) G T-p} based 
on p. 

As in Elog - , we need "root" as a parent pattern "to start with". 
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Theorem 2.5 A unary query is definable in Elog* 2 iff it is definable in MSO. 

Proof. Let V be an Elogj program and let V 1 be the program obtained from V by 
adding a rule 

p'(x) <-p(x ,x). 

for each pattern predicate appearing in V . It is easy to show by induction on the 
computation of 7p, that replacing each rule 

p(xo,x) <- po(-,x ), subelem 7r (x ,x), C, pi(_,xi), p n (-,x n ). 

of V (where C is a set of condition atoms) by 

p(x ,x) <- p'o(x ), subelem 7r (x ,x), C, p[(xi), p'„(x„). 

does not change the meaning of the program. But then, if we only want to com- 
pute the unary versions of the pattern predicates, we can just as well replace the 
heads p(xo,x) by p'(x) as well. This leads to a monadic datalog program over 
r ur U {subclcm^p}. The theorem now follows immediately from Lemma 12.21 and 
Proposition 3.3 in ^U- D 

The rationale of supporting binary pattern predicates in Elog is to explicitly 
build the edge relation of an output graph during the wrapping process. The obvious 
unfolding of this directed graph into a tree is what we consider the result of a 
wrapper run. 

Definition 2.6 The output language of ElogJ is defined as follows. An Elog?; 
program V defines a function mapping each document f to a node-labeled directed 
graph 

g= <y = dom t , e = {{v u v 2 ) \ Pi{vi,v 2 ) er;}, {Q P ) pe p) 

where Q p = {v \ (3v') p(v',v) G 7p} and P is the set of pattern predicate names 
occurring in V . □ 

The edge relation E constitutes a partial order of the nodes. The graph is acyclic 
except for loops of the form (v, v) £ E, which are due to specialization rules that 
produce such loops. In all other rules with a head p(x,y), y matches only nodes 
strictly below the nodes matched by x in the tree. 

Lemma 2.7 Each Elog^ binary pattern predicate is definable in MSO. 

Proof. Let V be an ElogJ program and r be a rule of V with head P(xq,x), 
undistinguished variables Xj lt . . . ,Xj t (i.e., Xq and x are precisely the variables of 
rule r not contained in this list), and a body that consists of the pattern atoms 
Pij (_, Xi x ), . . . , Pi m (_, Xi m ) and the set B of remaining atoms. 

We use the representation of V as a monadic program V 1 as described in the 
proof of Theorem 12. 51 to define a formula ip such that 93(^1, . . . , w m ) is true iff there 
exist nodes v%, . . . , v m such that the atoms Pi 1 wx), . . . , P, m (v m ,w m ) evaluate to 
true on the input tree. Let 

ip(xi ± ,...,x im ) := (VPi)---(VP„) SAT(P 1 ,...,P n ) (P H ( Xil ) A • • ■ A P im {x t J) 

where SAT is obtained from V' as shown in the proof of Proposition 3.3 in 
Clearly, 

P r {x ,x) := (3x n ) ■ ■ ■ (3xj t ) (p(x it ,. . . ,X im ) A B 

is equivalent to the relation defined by the single rule r in V '. 

Now let Vp C V be the set of rules in the input program whose head predicate 
is P. The formula 

P(x Q ,x) := \y P r (x ,x) 

reVp 

defines the desired relation of pattern predicate P. □ 
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Theorem 2.8 The relations of G are MS O- definable. 

Proof. Let V be an Elogj program. The edge relation E is simply the union of 
the relations defined by each of the pattern predicates in V ', i.e. a disjunction of 
their MSO formulae that we have constructed in the proof of Lemma 12.71 The 
MSO-definability of the Q p relations was shown in Theorem 12. 51 □ 

We have seen that Elog~ has linear-time data complexity (see Theorem 4.1). The 
fixpoint of an ElogJ program (or an Elog program) and equally the edge relation of 
the output graph, however, can be of quadratic size. 

Example 2.9 Let t be a tree where all leaves are labeled "1", while all other nodes 
are labeled "b" . The single-rule program 

p{xo,x) <- dom(_, x ), subelem w= ^ / p= ^(x ,x). 

evaluates to a fixpoint of quadratic size at worst. For instance, consider a tree with 
branch nodes b±, . . . , b m and leaf nodes l\, . . . , l n such that hi is the parent of bi+i 
(for 1 < i < m) and b m is the parent of li, . . . , l n . Here, the binary relation defined 
by p is {(b h lj)\l<i< m, l<j< n}. □ 

Remark 2.10 Note that in full Elog as currently implemented, a range [p] can be 
put at the end of each rule, such that a rule 

p(x ,x) <— po(-,x ),subelem w (x ,x),C, R [p\. 

has the meaning that p(vq, v) is inferred from this rule if v G S[p], where v' G S iff 
there is an assignment of the variables in the body of the rule to nodes that renders 
the body true and xq is assigned to vq and x to v' . □ 

3 Other Wrapping Languages 

In this section, we compare the expressiveness of two further wrapping languages, 
namely regular path queries with nesting and HEL, the wrapping language of the 
W4F framework [TB], to Elog^. 

Other previously proposed wrapping languages were evaluated as well. The ma- 
jority of previous work is string-based (e.g., TSIMMIS QH, EDITOR 0, FLORID 
|15j . DEByE ^T2J , and Stalker ^]) an d artificially restricting these languages in 
some way to work on trees would not be true to their motivation. Thus, we decided 
not to include them in this discussion. For some other systems (such as XWrap 
14 , which is essentially tree-based like W4F or Lixto), no formal specifications 
have been published which can be made subject to expressiveness evaluations. 
Web query languages were also evaluated, but some (e.g., WebSQL [3], WebLOG 
are unsuitable for wrapping because they cannot access the structure of Web 
documents, and others 2 (e.g., WcbOQL [3j) are highly expressive query languages 
that permit data transformations not in the spirit of wrapping. 

3.1 Regular Path Queries with Nesting (RPN) 

The first language we compare to Elogj is obtained by combining regular path 
queries [2] with nesting to create complex structures. This new language - which 
we will call RPN (Regular Path queries with Nesting) - on one hand is simple yet 
appropriate for defining practical wrappers, and on the other hand serves to prepare 
some machinery for comparing further wrapping languages later on. 

2 For a survey of further Web query languages see 
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Definition 3.1 The syntax of RPN is defined by the grammar 

rpn: patom '.' rpn | 'txt' | '(' rpn '#' • • • rpn ')' 

patom: patorriQ \ patorriQ conds 

patom Q : path | path '[' range ']' 

range: rangeo ';' • • • ';' rangeo 

conds: '{' cond 'and' • • • 'and' cond '}' 

cond: patom '.' cond | 'txt' '=' string 

where rpn is the start production, a "rangeo" is either '*', i, or i — j (where i and 
j are integers), "path" denotes the regular expressions over HTML tag names, and 
"string" the set of strings. □ 

Example 13 . 31 below shows an RPN wrapper in this syntax. 

Definition 3.2 ((Denotational semantics of RPN)) Let ir denote a path, p a 
range, s a string, and v,v' tree nodes. Without loss of generality, we assume that 
every patom has a range 3 . The semantics function E maps, given a tree, each pair 
of an RPN statement W and a node to a complex object as follows: 

E[7r[p]{Yi and ... and Y n }.Xjv := (J{E[X]t/ | subelem^O, v') is true A 

C[FiK A ■ • • A C{Y n jv'} 
E[Xi# . . . #X n ]v := {(ElX.jv, . . . , E[X n ]v}} 
E[txt]u := {v.txt} 

Here, v.txt denotes the string value of a node, the concatenation of all text below 
node v in the input document. Above we assume that both a range description p 
and n conditions are present in a patom, but it is clear how to handle the cases 
where either one or both are missing. 

This definition makes use of the semantics function 

C : L(cond) — > dom — > Boolean 

for RPN conditions, which we define as follows. 

C[7r[p]{Y"i and ... and y„}.X]w := (3?/) subelem,r : p(i;, v') is true A 

C[Ti] A-"AC[K,J AC[X] 
CJtxt = sjv := if v.txt = s then true else false 

Given a tree t, an RPN statement X evaluates to Epf] rooi t . □ 

RPN statements can be strongly typed. It is easy the verify that an RPN 
statement W evaluates to a complex object of type T[W] on all trees, where 

T\patom Xj := T[X] 
T[(X 1 #...#X n )] := {<J[X 1 ],...,T[X n ])} 
T[.txt] := {String} 

for rpn statements X, X\ , . . . , X n . 

Example 3.3 The RPN statement 

html.body.table.tr{td[0].txt = "item" }.td[l]. txt 

3 We can always add a range [*] to a patom without a range without changing the semantics. 
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selects the second entries ("td[l]") of table rows ("tablc.tr") whose first entries have 
text value "item" . The type of this statement is 

T[html.body.table.tr{td[0].txt = "item" }.td[l].txt] = {String} 

Note in particular that the type is not {{String}}, even though there are two 
patoms not counting the condition! □ 

Remark 3.4 Note that the semantics of paths and conditions in RPN is similar 
to the semantics of a fragment obtained from XPath by prohibiting most of 
its function library (and therefore its arithmetic and string manipulation features). 
The simple RPN wrapper of Example 13. 31 is basically equivalent to the XPath query 

/html/body/table/tr[td[l] = "item"]/td[2]. 

A path of the form ■ ■ ■ j jaj ■ ■ ■ in XPath corresponds to • • • ._* .a. ■ ■ • in RPN. 

The main difference is that while XPath selects nodes of the input tree, RPN 
extracts text below nodes rather than selecting the nodes themselves. Another 
significant difference between XPath and RPN is that RPN statements may create 
complex objects (using the nesting construct) that cannot be built in XPath. □ 

Next, we will show that each wrapper expressible in RPN is also expressible in 
Elogj. Clearly, there is a mismatch between the forms of output Elogj and RPN 
produce which needs to be discussed first. The former language produces trees while 
the latter produces complex objects containing records. 

In the following, we will require Elogj programs to be of a special form that al- 
lows for a canonical mapping from the binary atoms computed by an Elog^ program 
to a complex object. 

Given an RPN statement W, each predicate must be uniquely associated to one 
set or record entry subterm of the type term T[W]. 

• For a predicate p that is associated to a distinguished set of TflWJ, an atom 
p(v, w) asserts that node w is in a set of the output object uniquely identified 
by p and v. 

• For a predicate p that is associated to a distinguished (set-typed) record entry 
of TJW] , an atom p(v, w) asserts that w is an element of a record entry in the 
output object uniquely identified by p and v. 

By ordering predicates 4 defining the entries of an RPN record appropriately and 
mapping two nodes Wi, u>2 such that there is a node v and two edges (v,Wi), (v,W2) 
labeled with the same predicate into a common set, we obtain the desired mapping 
to the complex object model of RPN. 

It can be easily argued that the distinction between the complex object model of 
RPN and the output of an Elog'2 program that satisfies the above-designed semantics 
is only cosmetical, indeed that what we produce is a canonical representation of a 
complex object data model by binary atoms. 

RPN also produces string values, while we have not discussed the form in which 
a tree node is output in Elog so far. We assume that the output of Elog for a 
node is the concatenation of all text below the node in the document tree. We also 
assume that text strings are accessible in the document tree (say, string "text" is 
represented as path-shaped subtree t^e^x^t^l) and can be checked using 
the predicate contains,^. (For instance, we can check whether a node x has string 
value "text" using containSi. e . a; .t._L l i*(a;, y), where y is a dummy variable.) 

4 Note that in the reference implementation of Elog (the Lixto system |7||H]), an ordering of 
pattern predicates can be defined such that edges of the tree unfolding of the output graph with 
a common parent node are ordered by their predicate (rather than by document order). 
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Theorem 3.5 For each wrapper expressible in RPN, there is an equivalent wrapper 
in Elogt,. 

Proof. Ranges in RPN are regular and can be encoded using the subelenv.p and 
contains,^ predicates. Clearly, each RPN range can be easily encoded as an ElogJ 
range. Without loss of generality, let W be an RPN statement in which every patom 
has a range. We create the Elogj program V := 7 5 E[VF](root) using the function Ve 
which maps each pair of an RPN statement and a "context" predicate to an Elog?; 
program, and which is defined as follows. 

VeMp\{Xi and ... and X n }.Y]( Po ) := 

{ p'(xo,x) <— p(_,a:o), subelenv^xca;), n(_,x), r n (-,x).}U 
Vc[Xi](n) U ■ • • U Vc{X n \{r n ) U VeIYW), 

J\l(X 1 #...#X n )l(p) := %[Xil(p)U---U%[X n ](p), 
ftN(p) := 0, 

where X\, . . . ,X n ,Y are RPN statements, n > 0, 7r is a path, p is a range, and 
p' ,ri, . . . ,r n are new predicates. 

As an auxiliary function for conditions, we have Vc, defined as 

Vc[ir\p]{Xi and ... andX„}.Y](p) := 

{ p(x ,x) <- dom(x ,a;), contains T . p (a;, y), ri(_,y), r„(_, y), s(_, y). } U 
P C [XJ (n) U • • • U V c \X n \ (r n ) U V c [Y] (s) 

Pc[txt = s](p) := {p(a; ,x) <- dom(x ,ar), contains s (x, y). } 
where s is a string. 

A number of predicates generated in this way may correspond to patoms that are 
followed by further patoms in the RPN statement W and for which no corresponding 
set exists in TflW] (sec Example 18. 8(1 . 

The reference implementation of Elog, Lixto, allows to define pattern predicates 
of a given Elog program as auxiliary. Atoms p(vq,v) of such predicates are then 
removed from the result of a wrapper run such that if atom p'(v, w) has also been 
inferred, we add p'(vq,vj) (closing the "gap" produced by dropping the auxiliary 
predicate). 

It is easy to see that the described mapping produces an Elog2 program that, 
when auxiliary predicates are eliminated in this way, maps to canonically to RPN 
complex objects. □ 

Theorem 3.6 There is an Elog^ wrapper for which no equivalent RPN wrapper 
exists. 

Proof. For trees of depth one, all RPN queries are first-order. We therefore cannot 
check whether, say, the root node has an even number of children, which we can do 
in MSO and thus, by Theorem m E l°g2- D 

3.2 HTML Extraction Language (HEL) 

In this section, we compare the expressive power of the HTML Extraction Language 
(HEL) of the World Wide Web Wrapper Factory (W4F) with the expressiveness of 
Elogj. For an introduction to and a formal specification of HEL see |18j . 

Defining the semantics of HEL is a tedious task. (The denotational semantics 
provided in ^H] takes nearly nine pages and does not yet cover all features!) Here, 
we proceed in three stages to cover HEL reasonably well. We will define a fragment 
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of HEL called HEL which drops a number of marginal features and introduce a 
slightly simplified version of it. HEL - ^-, which does not use HEL's index variables. 
HEL~j has the desirable property that the semantics of HEL~ and HEL - ^- entail a 
one-to-one relationship between wrappers in the two languages This variable- 
free syntax is possible because of the very special and restricted way in which 
index variables may be used in HEL. For simplicity, we first introduce HEL - ^ and 
subsequently HEL - . Finally, we discuss the remaining features of HEL. 

Let RPN~ be the fragment of RPN obtained by requiring that all patoms are 
restricted to the form t or (we will write the latter as — » t ), where t 
and conditions may not be nested inside conditions. 

The language HEL - * (that is, variable-free HEL - ) differs from RPN - semanti- 
cally in that ranges apply only to those nodes for which all given conditions hold 
(i.e., intuitively, conditions are evaluated "first"). 

Let 7T be cither .t or — > t (where t is a tag), and let 7Ti . . . 7r m be paths without 
conditions. We denote the HEL - ^- semantics function H (with ranges) by 

H[7r[/9]{7ri.ta£ = s\ A ■ ■ ■ A ir m .txt = s rn }.X]v : = 

{MfXjw | w € i?p(E[7r{7ri.ia;t = s 1 A ■ ■ ■ A TT m .txt = s m }]v)} 

where E is the RPN semantics function and R P (V) denotes the set of nodes of V 
matching the range w.r.t. document order, e.g. for range i 

Ri(V) := {yi | (3y ) ■ ■ • (3j/i_i) y , ■ ■ ■ , y t e V A --3y_i e V : -< y A 

Ao<fc< l (j/fc -< Vk+i A ->By' e V : y k < y' -< Vk+i)}- 

This selects the i + 1-th node of V. (In HEL, the index of the first node is 0.) 

On the remaining forms of HEL~^ statements (Xi#...#X n ) and txt, H is 
defined analogously to E. 

Theorem 3.7 (1) For each wrapper expressible in the HELAj language, there is an 
equivalent wrapper in Elog. (2) There is an ElogZ, wrapper for which no equivalent 
HEL~j wrapper exists. 

Proof. (1) can be shown using essentially the same proof as that of Theorem 13. 51 
with the difference that we use a feature of Elog (see e.g. |B]) that allows to put 
ranges on the nodes over which the variable x ranges (relative to xq) and replace 
rules 

p'(xo,x) <— p(.,Xo), subelem T , p (a;o,x), n(-,x), r n (-,x). 

by 

p'(x Q ,x) <—p(_,x ), subelem 7r (a;o,a;), n(-,x), r n (-,x) [p]. 

(2) can be justified by the same argument used previously for showing Theo- 
rem 13.61 □ 

Next wc discuss the HEL - language, a proper fragment of HEL. The syntax of 
HEL~ is considerably different from that of HEL - ^, using a form of index variables 
in ranges and a special "where" block at the end of a wrapper statement that collects 
all of the conditions, similar to database query languages such as SQL. To give a 
better overview of the language, we provide its full syntax. 

Definition 3.8 The syntax of the language HEL - is defined by the following gram- 
mar. 

HEL~: cc | cc 'where' conds 

cc: pseq.txt | pseq '(' cc • • • cc ')' 

pseq: patom (('.'|'— ►') patom)* 
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patom: 
vrange: 
conds: 
cond: 



tag [ tag '[' vrange ']' 
range | var ':' range | var 
cond 'and' • • ■ 'and' cond 
pseq. l tx.V = string 



where "var" is a set of index variable names, "int" is the set of integers, "tag" the 
set of HTML tag names, "string" the set of strings, and range is defined as in RPN 



There are a number of further syntactical conditions that restrict the way in 
which variables can be used in a wrapper. Each index variable used in a HEL~ 
statement occurs exactly once in its cc construct. Moreover, let P the set of paths 
that can be constructed by concatenating paths in the cc construct starting from 
the left and always choosing one element of a record while going to the right. Each 
cond construct c in the where clause of a wrapper is constrained in that the smallest 
prefix of c that contains all ranges with index variables has to match a prefix of a 
path (in terms of both tags and index variables appearing in ranges) in P. 

For example, 

html.body.table(tr[0].td[0].txt # tr[i:*].td[l].txt) 
where html. body.table.tr[i].td[0].txt = "item"; 

is correct HEL, because we can construct the path html.body.table.tr[i:*].td[l].txt 
while reading the cc construct from left to right, and this path and its index variables 
math the condition. (There is a single index variable i occurring in both paths at 
the same position, and the prefix html.body.table.tr is the same.) 

The semantics of HEL~ will not be introduced in detail but index variables 
are simply a tool to relate paths in the first "construction" part of the wrapper 
(everything up to the where clause) with conditions in the second part. 

A HEL - wrapper can be easily transformed into HEL~^ by simply removing its 
conditions one by one and merging them into the construction part of the wrapper. 
Starting from the left, each condition is deleted up to the rightmost of its variables, 
and the remaining condition is nested into the construction part of the wrapper at 
the position of that variable. For example, the HEL wrapper shown above can be 
written as 

html.body.table(tr[0].td[0].txt # tr[*]{td[0].txt = "item" }.td[l].txt); 
in BEL~ f . 

Proposition 3.9 (|6_) A wrapper is expressible in HEL~ iff it is expressible in 



Therefore, HEL~ inherits the expressiveness results of Theorem 13. 71 
HEL~ is the fragment of HEL obtained by taking HEL without string extraction 
using match and split expressions (although we support strings in conditions as 
essential to the philosophy of HEL) and without the getNumberOf and getAttr 
functions. Note that this is done to compare HEL in our framework based on 
the language Elogj. Full Elog again supports string extraction in the way HEL 
does. Using the getNumberOf function of HEL, one may require that the number 
of nodes (in the document tree) reachable through a given path starting from some 
node is equal to some constant number, which is easy to define in MSO. The getAttr 
function of HEL extracts HTML attributes, which we manage as tree nodes. In our 
framework, the function is redundant with those for accessing nodes. 

Some HEL statements can be required to be single- valued (i.e., for a statement 
W relative to node v, EfWjv must contain exactly one node). This is in particular 



(see Definition 13. lj l. 



□ 
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true for condition paths, which must always be single-valued. These issues are best 
handled at runtime (during complex object creation) using an exception handling 
mechanism as in W4F. 

Remark 3.10 HEL also supports some form of Prolog-like cut "!" with which 
some conditions can be marked. The cut causes the evaluation of a path to stop if a 
condition marked with the cut is false. The HEL cut, however, has not been covered 
in the formal semantics definition of |18| or unambiguously explained elsewhere. 
Several different meanings are imaginable. 

Let us consider one meaning of the cut, where, given a node v, we first evaluate 
the path it, and then remove all nodes w that either violate a condition or for which 
there is a different node wq such that wq is reachable from v through it and wq -< w. 

We can formally denote the changed semantics of paths with conditions and the 
cut (but without ranges) by a semantics function Ho such that 

Ho[7r{7Ti.tet = si A • • • A ~K m .txt = s m }]w := 

{z I subelem T (w, z) A C(z) A (Var) (x ^ z A subelem^ (v, x)) — ► C ] (x)} 

where C(v) := Ai<fc<m Cf^k-txt = Skjv and C'(z) if for all conditions 7Tfc.txt = Sk 
with the cut, Cfrr^.txt — s^jv is true. 5 This semantics function H can easily be 
integrated into the above-described function H to cover ranges as well. 

This essentially provides us with a definition of the edge relation that determines 
the complex objects computed by full HEL wrappers in MSO (see the previous 
section where we have discussed the relationship between such a binary relation 
and complex objects), ft follows that all unary HEL queries (for any reasonable 
definition of such queries) are definable in Elog~. □ 
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